NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Customizing the Inductive Biases of Softmax Attention using Structured Matrices

Kuang, Yilun; Amsel, Noah; Lotfi, Sanae; Qiu, Shikai; Potapczynski, Andres; Wilson, Andrew Gordon (July 2025, Proceedings of the 42nd International Conference on Machine Learning)

The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it causes information loss for certain tasks that have intrinsically high-dimensional inputs. Additionally, attention uses the same scoring function for all input pairs, without imposing a distance-dependent compute bias for neighboring tokens in the sequence. In this work, we address these shortcomings by proposing new scoring functions based on computationally efficient structured matrices with high ranks, including Block Tensor-Train (BTT) and Multi-Level Low Rank (MLR) matrices. On in-context regression tasks with high-dimensional inputs, our proposed scoring functions outperform standard attention for any fixed compute budget. On language modeling, a task that exhibits locality patterns, our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention. Additionally, we show that both BTT and MLR fall under a broader family of efficient structured matrices capable of encoding either full-rank or distance-dependent compute biases, thereby addressing significant shortcomings of standard attention. Finally, we show that MLR attention has promising results for long-range time-series forecasting.
more » « less
Full Text Available
A nearby dark molecular cloud in the Local Bubble revealed via H2 fluorescence

https://doi.org/10.1038/s41550-025-02541-7

Burkhart, Blakesley; Dharmawardena, Thavisha E; Bialy, Shmuel; Haworth, Thomas J; Cruz_Aguirre, Fernando; Jo, Young-Soo; Andersson, B-G; Chung, Haeun; Edelstein, Jerry; Grenier, Isabelle; et al (April 2025, Nature Astronomy)

Full Text Available
On Uncertainty, Tempering, and Data Augmentation in Bayesian Classification

Kapoor, Sanyam; Maddox, Wesley; Izmailov, Pavel; Wilson, Andrew Gordon (October 2022, NeurIPS)

Aleatoric uncertainty captures the inherent randomness of the data, such as measurement noise. In Bayesian regression, we often use a Gaussian observation model, where we control the level of aleatoric uncertainty with a noise variance parameter. By contrast, for Bayesian classification we use a categorical distribution with no mechanism to represent our beliefs about aleatoric uncertainty. Our work shows that explicitly accounting for aleatoric uncertainty significantly improves the performance of Bayesian neural networks. We note that many standard benchmarks, such as CIFAR-10, have essentially no aleatoric uncertainty. Moreover, we show that data augmentation in approximate inference softens the likelihood, leading to underconfidence and misrepresenting our beliefs about aleatoric uncertainty. Accordingly, we find that a cold posterior, tempered by a power greater than one, often more honestly reflects our beliefs about aleatoric uncertainty than no tempering --- providing an explicit link between data augmentation and cold posteriors. We further show that we can match or exceed the performance of posterior tempering by using a Dirichlet observation model, where we explicitly control the level of aleatoric uncertainty, without any need for tempering.
more » « less
Full Text Available
On Feature Learning in the Presence of Spurious Correlations

Izmailov, Pavel; Kirichenko, Polina; Gruver, Nate; Wilson, Andrew Gordon (October 2022, NeurIPS)

Deep classifiers are known to rely on spurious features — patterns which are correlated with the target on the training data but not inherently relevant to the learning problem, such as the image backgrounds when classifying the foregrounds. In this paper we evaluate the amount of information about the core (non-spurious) features that can be decoded from the representations learned by standard empirical risk minimization (ERM) and specialized group robustness training. Following recent work on Deep Feature Reweighting (DFR), we evaluate the feature representations by re-training the last layer of the model on a held-out set where the spurious correlation is broken. On multiple vision and NLP problems, we show that the features learned by simple ERM are highly competitive with the features learned by specialized group robustness methods targeted at reducing the effect of spurious correlations. Moreover, we show that the quality of learned feature representations is greatly affected by the design decisions beyond the training method, such as the model architecture and pre-training strategy. On the other hand, we find that strong regularization is not necessary for learning high-quality feature representations. Finally, using insights from our analysis, we significantly improve upon the best results reported in the literature on the popular Waterbirds, CelebA hair color prediction and WILDS-FMOW problems, achieving 97\%, 92\% and 50\% worst-group accuracies, respectively.
more » « less
Full Text Available
Volatility Based Kernels and Moving Average Means for Accurate Forecasting with Gaussian Processes

Benton, Gregory; Maddox, Wesley; Wilson, Andrew Gordon (July 2022, Proceedings of the 39th International Conference on Machine Learning)

A broad class of stochastic volatility models are defined by systems of stochastic differential equations, and while these models have seen widespread success in domains such as finance and statistical climatology, they typically lack an ability to condition on historical data to produce a true posterior distribution. To address this fundamental limitation, we show how to re-cast a class of stochastic volatility models as a hierarchical Gaussian process (GP) model with specialized covariance functions. This GP model retains the inductive biases of the stochastic volatility model while providing the posterior predictive distribution given by GP inference. Within this framework, we take inspiration from well studied domains to introduce a new class of models, Volt and Magpie, that significantly outperform baselines in stock and wind speed forecasting, and naturally extend to the multitask setting.
more » « less
Full Text Available
Chroma-VAE: Mitigating Shortcut Learning with Generative Classifiers

Yang, Wanqian; Kirichenko, Polina; Goldblum, Micah; Wilson, Andrew Gordon (September 2022, NeurIPS)

Deep neural networks are susceptible to shortcut learning, using simple features to achieve low training loss without discovering essential semantic structure. Contrary to prior belief, we show that generative models alone are not sufficient to prevent shortcut learning, despite an incentive to recover a more comprehensive representation of the data than discriminative approaches. However, we observe that shortcuts are preferentially encoded with minimal information, a fact that generative models can exploit to mitigate shortcut learning. In particular, we propose Chroma-VAE, a two-pronged approach where a VAE classifier is initially trained to isolate the shortcut in a small latent subspace, allowing a secondary classifier to be trained on the complementary, shortcut-free latent subspace. In addition to demonstrating the efficacy of Chroma-VAE on benchmark and real-world shortcut learning tasks, our work highlights the potential for manipulating the latent space of generative classifiers to isolate or interpret specific correlations.
more » « less
Full Text Available
Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations

https://doi.org/10.48550/arXiv.2204.02937

Kirichenko, Polina; Izmailov, Pavel; Wilson, Andrew Gordon (July 2022, International Conference on Machine Learning)

Neural network classifiers can largely rely on simple spurious features, such as backgrounds, to make predictions. However, even in these cases, we show that they still often learn core features associated with the desired attributes of the data, contrary to recent findings. Inspired by this insight, we demonstrate that simple last layer retraining can match or outperform state-of-the-art approaches on spurious correlation benchmarks, but with profoundly lower complexity and computational expenses. Moreover, we show that last layer retraining on large ImageNet-trained models can also significantly reduce reliance on background and texture information, improving robustness to covariate shift, after only minutes of training on a single GPU.
more » « less
Full Text Available
Low-Precision Stochastic Gradient Langevin Dynamics

Zhang, Ruqi; Wilson, Andrew Gordon; De Sa, Christopher (July 2022, Proceedings of Machine Learning Research)
Kamalika Chaudhuri, Stefanie Jegelka (Ed.)
While low-precision optimization has been widely used to accelerate deep learning, low-precision sampling remains largely unexplored. As a consequence, sampling is simply infeasible in many large-scale scenarios, despite providing remarkable benefits to generalization and uncertainty estimation for neural networks. In this paper, we provide the first study of low-precision Stochastic Gradient Langevin Dynamics (SGLD), showing that its costs can be significantly reduced without sacrificing performance, due to its intrinsic ability to handle system noise. We prove that the convergence of low-precision SGLD with full-precision gradient accumulators is less affected by the quantization error than its SGD counterpart in the strongly convex setting. To further enable low-precision gradient accumulators, we develop a new quantization function for SGLD that preserves the variance in each update step. We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks.
more » « less
Full Text Available
Low-Precision Arithmetic for Fast Gaussian Processes

Maddox, Wesley J.; Potapczynski, Andres; Wilson, Andrew Gordon. (July 2022, UAI 2022.)

Low-precision arithmetic has had a transformative effect on the training of neural networks, reducing computation, memory and energy requirements. However, despite its promise, low-precision arithmetic has received little attention for Gaussian processes (GPs), largely because GPs require sophisticated linear algebra routines that are unstable in low-precision. We study the different failure modes that can occur when training GPs in half precision. To circumvent these failure modes, we propose a multi-faceted approach involving conjugate gradients with re-orthogonalization, mixed precision, and preconditioning. Our approach significantly improves the numerical stability and practical performance of conjugate gradients in low- precision over a wide range of settings, enabling GPs to train on 1.8 million data points in 10 hours on a single GPU, without any sparse approximations.
more » « less
Full Text Available
PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization

Lotfi, Sanae; Finzi, Marc Anton; Kapoor, Sanyam; Potapczynski, Andres; Goldblum, Micah; Wilson, Andrew Gordon (October 2022, NeurIPS)

While there has been progress in developing non-vacuous generalization bounds for deep neural networks, these bounds tend to be uninformative about why deep learning works. In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning. Notably, we find large models can be compressed to a much greater extent than previously known, encapsulating Occam’s razor.
more » « less
Full Text Available

« Prev Next »

Search for: All records